So far in this class, we’ve only done supervised learning. Meaning, we have a response variable and we observe its value for all or some of our observations.
Clustering is a type of unsupervised learning. Meaning, we want to sort our observations into clusters based on the predictors, but we don’t have a pre-conceived notion of what those clusters represent!
The general goal of clustering is to make clusters such that points within a cluster are closer to each other than to the points outside the cluster.
What is our definition of close?
How many clusters do we think exist?
What algorithm do we use to select the clusters?
Idea: Iteratively update the centers of the clusters until convergence.
Choosing Initial Centroids
There are many ways to choose initial centroids – randomly in space, randomly chosen observations in the dataset, the farthest point(s).
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
The kmeans() function must be given a numeric matrix / dataframe and a number of clusters.
It gives back centroids, cluster assignments, and sum-of-squares.
a do is or this all down it
1 0.02264 0.0004926 0.011389 0.006299 0.007694 0.003692 0.00006095 0.01290
2 0.01133 0.0005557 0.006617 0.011255 0.004472 0.002808 0.00000000 0.01659
3 0.01991 0.0003927 0.011035 0.006990 0.007490 0.003988 0.00020684 0.01388
our to also even its shall up an
1 0.0020956 0.03882 0.0003545 0.0009570 0.003617 0.001507 0.00037792 0.005632
2 0.0044983 0.03462 0.0013692 0.0005221 0.002432 0.001175 0.00000000 0.002015
3 0.0007316 0.03991 0.0004887 0.0008273 0.003761 0.001594 0.00006386 0.005143
every may should upon and for. more so
1 0.0018063 0.004356 0.001963 0.0027342 0.02559 0.006711 0.002886 0.002195
2 0.0004988 0.003866 0.002959 0.0001198 0.04885 0.006796 0.005968 0.003247
3 0.0015005 0.004252 0.002164 0.0024354 0.02481 0.006119 0.002760 0.001935
was any from must some were are had
1 0.001539 0.003022 0.005779 0.002555 0.0014136 0.001507 0.005327 0.001454
2 0.001699 0.002560 0.006205 0.001444 0.0015569 0.001991 0.005777 0.001105
3 0.001450 0.002925 0.004930 0.001991 0.0009518 0.001211 0.004672 0.001243
my such what as has no than when
1 0.0003375 0.002146 0.001287 0.008645 0.003715 0.002593 0.003044 0.0010076
2 0.0003618 0.003601 0.001148 0.011498 0.001965 0.001162 0.004317 0.0016872
3 0.0001322 0.002073 0.001191 0.008948 0.002657 0.002629 0.002846 0.0008404
at have not that which be her now
1 0.003359 0.007324 0.006319 0.01493 0.011130 0.01995 0.0008013 0.0004449
2 0.002607 0.005895 0.007625 0.01751 0.006758 0.01880 0.0010409 0.0004457
3 0.002745 0.006013 0.006159 0.01470 0.010791 0.02168 0.0001255 0.0004326
the who been his of their will but
1 0.08707 0.002502 0.004712 0.0021895 0.06232 0.005920 0.006149 0.003631
2 0.06104 0.003742 0.001825 0.0006076 0.04376 0.009810 0.008446 0.004781
3 0.10641 0.001867 0.003792 0.0017050 0.06602 0.004578 0.006730 0.003198
if. on then with by in. one there
1 0.003052 0.004274 0.0003348 0.005290 0.008641 0.02383 0.002618 0.002865
2 0.004054 0.005086 0.0005555 0.007253 0.009479 0.01921 0.005726 0.001024
3 0.003313 0.004203 0.0004346 0.005784 0.007441 0.02323 0.002792 0.002621
would can into only things your
1 0.007405 0.002344 0.001717 0.001319 0.00024624 0.0002759
2 0.009061 0.002233 0.003068 0.003100 0.00008643 0.0004461
3 0.007842 0.002768 0.001345 0.001586 0.00018331 0.0000000
Did we really need all 200 variables to find those clusters?
Did we maybe “muddy the waters” by weighing all variables equally?
It is very common to do a PCA reduction before running k-means!
# A tibble: 7 × 3
clust auth n
<int> <chr> <int>
1 1 AH 26
2 1 JM 1
3 2 AH 3
4 2 JJ 1
5 2 JM 13
6 3 JJ 4
7 4 AH 22
Pros:
Simple algorithm, easy to understand
Plays nice with PCA
SUPER fast to compute
Cons:
Very sensitive to location of initial centroids
User has to pick the number of clusters
Open a Activity-Clustering.Rmd
Now, refer back to your PCA analysis of the cannabis data (from Monday).
(also called agglomerative clustering)
Idea: Merge observations that are close together.
The hclust() function needs to be given a distance matrix.
When you give hclust() a distance matrix…
To decide how to assign clusters, we can choose how many clusters we want…
[1] 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1
[39] 1 3 3 3 3 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Or we can choose a height cutoff for the dendrogram
[1] 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 3 3 1 3 1 1 1 3 3 3 1 1 3 1 1 1 1 1 1 4 1 1
[39] 3 4 4 4 4 4 1 1 1 1 2 3 3 3 1 3 3 1 1 1 3 3 1 3 1 1 3 1 3 1 3 3
Pros:
Fast computation for moderate sized data
Gives back full information in dendrogram form
Cons:
Open Activity-Clustering.qmd
1 Apply hierarchical clustering to the cannabis data 2. Compare your results to k-means. 3. Which method do you prefer? Why?